bad behavior
The Download: LLM confessions, and tapping into geothermal hot spots
OpenAI is testing a new way to expose the complicated processes at work inside large language models. Researchers at the company can make an LLM produce what they call a confession, in which the model explains how it carried out a task and (most of the time) own up to any bad behavior. Figuring out why large language models do what they do--and in particular why they sometimes appear to lie, cheat, and deceive--is one of the hottest topics in AI right now. If this multitrillion-dollar technology is to be deployed as widely as its makers hope it will be, it must be made more trustworthy. OpenAI sees confessions as one step toward that goal. Sometimes geothermal hot spots are obvious, marked by geysers and hot springs on Earth's surface.
OpenAI's new confession system teaches models to be honest about bad behaviors
OpenAI's new confession system teaches models to be honest about bad behaviors I guess AI gotta give part two of my confessions. OpenAI announced today that it is working on a framework that will train artificial intelligence models to acknowledge when they've engaged in undesirable behavior, an approach the team calls a confession. Since large language models are often trained to produce the response that seems to be desired, they can become increasingly likely to provide sycophancy or state hallucinations with total confidence. The new training model tries to encourage a secondary response from the model about what it did to arrive at the main answer it provides. Confessions are only judged on honesty, as opposed to the multiple factors that are used to judge main replies, such as helpfulness, accuracy and compliance.
OpenAI has trained its LLM to confess to bad behavior
Large language models often lie and cheat. We can't stop that--but we can make them own up. OpenAI is testing another new way to expose the complicated processes at work inside large language models. Researchers at the company can make an LLM produce what they call a confession, in which the model explains how it carried out a task and (most of the time) owns up to any bad behavior. Figuring out why large language models do what they do--and in particular why they sometimes appear to lie, cheat, and deceive--is one of the hottest topics in AI right now. If this multitrillion-dollar technology is to be deployed as widely as its makers hope it will be, it must be made more trustworthy.
When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
Emmons, Scott, Jenner, Erik, Elson, David K., Saurous, Rif A., Rajamanoharan, Senthooran, Chen, Heng, Shafkat, Irhum, Shah, Rohin
While chain-of-thought (CoT) monitoring is an appealing AI safety defense, recent work on "unfaithfulness" has cast doubt on its reliability. These findings highlight an important failure mode, particularly when CoT acts as a post-hoc rationalization in applications like auditing for bias. However, for the distinct problem of runtime monitoring to prevent severe harm, we argue the key property is not faithfulness but monitorability. To this end, we introduce a conceptual framework distinguishing CoT-as-rationalization from CoT-as-computation. We expect that certain classes of severe harm will require complex, multi-step reasoning that necessitates CoT-as-computation. Replicating the experimental setups of prior work, we increase the difficulty of the bad behavior to enforce this necessity condition; this forces the model to expose its reasoning, making it monitorable. We then present methodology guidelines to stress-test CoT monitoring against deliberate evasion. Applying these guidelines, we find that models can learn to obscure their intentions, but only when given significant help, such as detailed human-written strategies or iterative optimization against the monitor. We conclude that, while not infallible, CoT monitoring offers a substantial layer of defense that requires active protection and continued stress-testing.
AIhub coffee corner: Bad practice in the publication world
This month we tackle the topic of bad practice in the sphere of publication. Joining the conversation this time are: Sanmay Das (Virginia Tech), Tom Dietterich (Oregon State University), Sabine Hauert (University of Bristol), and Sarit Kraus (Bar-Ilan University). Sabine Hauert: Today's topic is bad practice in the publication world. For example, people trying to cheat the review system, paper mills. What bad behaviors have you seen, and is it really a problem? Tom Dietterich: Well, I can talk about it from an arXiv point of view.
Circuit Breaking: Removing Model Behaviors with Targeted Ablation
Li, Maximilian, Davies, Xander, Nadeau, Max
Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. In the setting of reducing GPT-2 toxic language generation, we find ablating just 12 of the 11.6K causal edges mitigates toxic generation with minimal degradation of performance on other inputs.
Meta's prototype moderation AI only needs a few examples of bad behavior to take action
Moderating content on today's internet is akin to a round of Whack-A-Mole with human moderators continually forced to react in realtime to changing trends, such as vaccine mis- and disinformation or intentional bad actors probing for ways around established personal conduct policies. Machine learning systems can help alleviate some of this burden by automating the policy enforcement process, however modern AI systems often require months of lead time to properly train and deploy (time mostly spent collecting and annotating the thousands, if not millions of, necessary examples). To shorten that response time, at least to a matter of weeks rather than months, Meta's AI research group (formerly FAIR) has developed a more generalized technology that requires just a handful of specific examples in order to respond to new and emerging forms of malicious content, called Few-Shot Learner (FSL). Few-shot learning is a relatively recent development in AI, essentially teaching the system to make accurate predictions based on a limited number of training examples -- quite the opposite of conventional supervised learning methods. For example, if you wanted to train a standard SL model to recognize pictures of rabbits, you feed it a couple hundred thousands of rabbit pictures and then you can present it with two images and ask if they both show the same animal. Thing is, the model doesn't know if the two pictures are of rabbits because it doesn't actually know what a rabbit is.
Can AI Fortify Your Organization's Cybersecurity Strategy?
If, as we have seen in recent months, the rate of digital transformation has gone beyond anything we have seen in the past, it has also opened up many enterprises to attack in ways that have never been possible until now. Each time an organization adds new technology to the digital workplace it exposes itself to new risks. However, there are also new ways to protect their digital assets, just as there are new ways to ensure productivity. At the end of last year, Capgemini released research into how organizations are turning to artificial intelligence (AI) to protect their digital properties. Titled Reinventing Cybersecurity with Artificial Intelligence, it showed that 42% of the companies studied had seen a rise in security incidents through time-sensitive applications.
From the Field: Machine Learning and Artificial Intelligence for Malware Prevention
For many years, the main threat protection products were based on signatures. It's time to think beyond the traditional antivirus (AV). I recently participated in proof-of-concept (PoC) testing of the CyancePROTECT agent and was deeply impressed with the product's AI-driven malware prevention capabilities in comparison to more traditional approaches. The following are some key observations of the PoC outcomes. For those of you who probably don't know, heuristics has been a technology designed to proactively detect malicious code, without having to have a specific signature.
CEO Behind Tinder, OkCupid on the Future of Online Dating
In her nearly 13 years at Match Group Inc., where she became chief executive in January, Ms. Ginsberg has watched the stigma of online dating fade almost entirely. Today, many people even proudly pursue a multiapp dating strategy. Match owns well-known dating apps including Tinder, Hinge and OkCupid, along with lesser-known brands such as PetPeopleMeet.com, The Dallas-based company is expanding in Latin America, Japan, South Korea and India to tap what it estimates is a market of 600 million singles. Her first year at the helm has been an eventful one. After unsuccessfully trying to acquire the dating app Bumble, Match sued its rival last spring for infringing patents for "swiping" and other features that have made Tinder popular.